This notebook will contain the code needed to execute our data analysis project and answer the questions we would like to ask of the Spotify and YouTube data from Kaggle.

Load the libraries

library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
── Attaching packages ──────────────────────────────────────────────────────── tidyverse 1.3.2 ──✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 1.0.0── Conflicts ─────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(lubridate)

Attaching package: ‘lubridate’

The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union
library(janitor)

Attaching package: ‘janitor’

The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test

Load and clean data

spotify_youtube <- read_csv("data/Spotify_Youtube.csv") %>%
clean_names() %>%
rename(number = x1) %>%
  select(-c(number)) %>%
mutate(duration_secs = duration_ms/1000, duration_mins = duration_ms/60000)
New names:Rows: 20718 Columns: 28── Column specification ─────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Artist, Url_spotify, Track, Album, Album_type, Uri, Url_youtube, Title, Channel, De...
dbl (16): ...1, Danceability, Energy, Key, Loudness, Speechiness, Acousticness, Instrumentaln...
lgl  (2): Licensed, official_video
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(spotify_youtube)
Rows: 20,718
Columns: 29
$ artist           <chr> "Gorillaz", "Gorillaz", "Gorillaz", "Gorillaz", "Gorillaz", "Gorillaz"…
$ url_spotify      <chr> "https://open.spotify.com/artist/3AA28KZvwAUcZuOKwyblJQ", "https://ope…
$ track            <chr> "Feel Good Inc.", "Rhinestone Eyes", "New Gold (feat. Tame Impala and …
$ album            <chr> "Demon Days", "Plastic Beach", "New Gold (feat. Tame Impala and Bootie…
$ album_type       <chr> "album", "album", "single", "album", "album", "album", "single", "albu…
$ uri              <chr> "spotify:track:0d28khcov6AiegSCpG5TuT", "spotify:track:1foMv2HQwfQ2vnt…
$ danceability     <dbl> 0.818, 0.676, 0.695, 0.689, 0.663, 0.760, 0.716, 0.726, 0.741, 0.625, …
$ energy           <dbl> 0.705, 0.703, 0.923, 0.739, 0.694, 0.891, 0.897, 0.815, 0.913, 0.877, …
$ key              <dbl> 6, 8, 1, 2, 10, 11, 4, 11, 2, 10, 9, 4, 9, 0, 0, 11, 0, 0, 7, 5, 6, 11…
$ loudness         <dbl> -6.679, -5.815, -3.930, -5.810, -8.627, -5.852, -7.185, -5.886, -3.340…
$ speechiness      <dbl> 0.1770, 0.0302, 0.0522, 0.0260, 0.1710, 0.0372, 0.0629, 0.0313, 0.0465…
$ acousticness     <dbl> 8.36e-03, 8.69e-02, 4.25e-02, 1.51e-05, 2.53e-02, 2.29e-02, 1.20e-02, …
$ instrumentalness <dbl> 2.33e-03, 6.87e-04, 4.69e-02, 5.09e-01, 0.00e+00, 8.69e-02, 2.62e-01, …
$ liveness         <dbl> 0.6130, 0.0463, 0.1160, 0.0640, 0.0698, 0.2980, 0.3250, 0.1120, 0.3250…
$ valence          <dbl> 0.772, 0.852, 0.551, 0.578, 0.525, 0.966, 0.358, 0.462, 0.643, 0.865, …
$ tempo            <dbl> 138.559, 92.761, 108.014, 120.423, 167.953, 120.264, 127.030, 140.158,…
$ duration_ms      <dbl> 222640, 200173, 215150, 233867, 340920, 245000, 274142, 209560, 213750…
$ url_youtube      <chr> "https://www.youtube.com/watch?v=HyHNuVaZJ-k", "https://www.youtube.co…
$ title            <chr> "Gorillaz - Feel Good Inc. (Official Video)", "Gorillaz - Rhinestone E…
$ channel          <chr> "Gorillaz", "Gorillaz", "Gorillaz", "Gorillaz", "Gorillaz", "Gorillaz"…
$ views            <dbl> 693555221, 72011645, 8435055, 211754952, 618480958, 259021161, 451996,…
$ likes            <dbl> 6220896, 1079128, 282142, 1788577, 6197318, 1844658, 11686, 17675, 739…
$ comments         <dbl> 169907, 31003, 7399, 55229, 155930, 72008, 241, 260, 20296, 39240, 121…
$ description      <chr> "Official HD Video for Gorillaz' fantastic track Feel Good Inc.\n\nFol…
$ licensed         <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TR…
$ official_video   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRU…
$ stream           <dbl> 1040234854, 310083733, 63063467, 434663559, 617259738, 323850327, 1066…
$ duration_secs    <dbl> 222.640, 200.173, 215.150, 233.867, 340.920, 245.000, 274.142, 209.560…
$ duration_mins    <dbl> 3.710667, 3.336217, 3.585833, 3.897783, 5.682000, 4.083333, 4.569033, …

Basic exploratory analysis

** The dataset has 20,718 rows and 28 columns. There are some NA values within the YouTube data which could serve as a limitation. Similarly, within the YouTube description values, some of the descriptions have emojis or other characters and symbols that could be difficult to work with. Another limitation could arise with the values in the instrumentalness column since they include negative exponents which could also be difficult to work with for different analysis calculations. The original source of the data defines the columns well, otherwise we may make them more complicated. Since the data includes artists whose music is on Spotify but probably not every artist in the world, we would not be able to make assumptions about the music industry as a whole. Another limitation we can notice is that some of the songs that fall under an artist’s most popular songs are feautres of said artist on another song. This can be kind of confusing, but we might be able to work around it using filters once we start more analysis. However, this data is also helpful for answering our question about how collaborations affect an artist’s popularity. One code we would need to make note of is ‘key,’ which denotes pitch notation but we need to find a way to make it easily apparent what the pitch is rather than just seeing a number. There could also be issues with repeat songs when songs are on more than one album, although sometimes it will be the same song but a slightly different rendition.

##Questions Question 1: Which attributes, such as danceability, energy, loudness, etc., tend to have a correlation with the most streamed songs?

Analysis: After using code to select the variables we wanted to work with, we found the correlation coefficient for each song attribute in relation to the number of streams for each song. After finding the correlation coefficients, we calculated the mean among the coefficients for each attribute to average it out. After that, we looked for the maximum value among all the means, and found that danceability and streams had the highest positive correlation coefficient of about 0.073. There were some previous errors due to some values being NA, but using the filter for complete cases got rid of them, and the data we did use was so large that it provided a general idea of which attribute had the greatest correlation to number of streams.

attribute_correlate_stream <- spotify_youtube %>% 
  select(danceability, energy, key, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, stream) %>% 
  filter(complete.cases(.)) %>%
## find the correlation coefficient between each attribute and number of spotify streams 
   mutate(
    dance_cor = cor(danceability, stream),
    energy_cor = cor(energy, stream),
    key_cor = cor(key, stream),
    speech_cor = cor(speechiness, stream),
    acoustic_cor = cor(acousticness, stream),
    instrumental_cor = cor(instrumentalness, stream),
    live_cor = cor(liveness, stream),
    valence_cor = cor(valence, stream),
    tempo_cor = cor(tempo, stream)
  ) %>% 
## find the mean correlation coefficient for each attribute
  summarise(
    mean_dance_cor = mean(dance_cor),
    mean_energy_cor = mean(energy_cor),
    mean_key_cor = mean(key_cor),
    mean_speech_cor = mean(speech_cor),
    mean_acoustic_cor = mean(acoustic_cor),
    mean_instrumental_cor = mean(instrumental_cor),
    mean_live_cor = mean(live_cor),
    mean_valence_cor = mean(valence_cor),
    mean_tempo_cor = mean(tempo_cor)
  )
##output the highest value among the mean coefficients
max_value <- apply(attribute_correlate_stream, 1, max)
  

Question 2: Does higher engagement on YouTube videos lead to more streams of the song from the video on Spotify? Is there a relationship that exists between social engagement and streams?

Analysis: In order to find and map out the correlation between Youtube video views and Spotify streams, we first grouped the data by artist and then summarised the data by the sum of views and sum of streams for each artist. After that, we plotted the data into a scatter plot with a line of best fit, which shows that there is a slight positive correlation between number of views on Youtube and number of streams on Spotify. This means that it can be generally true that as one value increases, so does the other. While doing this, we divided the totals by 1,000,000 because the numbers would have been to large to easily grasp.

##code to find the total number of views and streams (divided by 1,000,000 because numbers were too large to work with)
streams_views <- spotify_youtube %>% 
  select(artist, track, stream, views) %>% 
  mutate(difference = abs(stream-views)) %>% 
  group_by(artist) %>% 
  summarise(
    total_views = sum(views)/1000000,
    total_streams = sum(stream)/1000000,
  )
##create a scatterplot with line of best fit
streams_views %>% 
  ggplot(aes(x=total_streams, y=total_views))+
  geom_point(size=2)+
  theme_minimal()+
  geom_smooth(method = "lm")+
  labs(
    title = "Correlation between an artist's Spotify Streams and Youtube Views",
    x = "Number of Streams on Spotify",
    y = "Number of Views on Youtube"
  )

Question 3: How many videos with a high number of streams are coming from licensed content?

Analysis:

Question 4: How do collaborations or features on a song affect its popularity on Spotify and Youtube? What are the most popular collaborations?

Analysis:

---
title: "Data Analysis Project"
names: "Kiersten Hacker and Sherwin-Nestor Esguerra"
date: "4-11-2023"
output: html_notebook
---

This notebook will contain the code needed to execute our data analysis project and answer the questions we would like to ask of the Spotify and YouTube data from Kaggle.

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

### Load the libraries
```{r}
library(tidyverse)
library(lubridate)
library(janitor)
```

### Load and clean data
```{r}
spotify_youtube <- read_csv("data/Spotify_Youtube.csv") %>%
clean_names() %>%
rename(number = x1) %>%
  select(-c(number)) %>%
mutate(duration_secs = duration_ms/1000, duration_mins = duration_ms/60000)

glimpse(spotify_youtube)


```

### Basic exploratory analysis
** The dataset has 20,718 rows and 28 columns. There are some NA values within the YouTube data which could serve as a limitation. Similarly, within the YouTube description values, some of the descriptions have emojis or other characters and symbols that could be difficult to work with. Another limitation could arise with the values in the instrumentalness column since they include negative exponents which could also be difficult to work with for different analysis calculations. The original source of the data defines the columns well, otherwise we may make them more complicated. Since the data includes artists whose music is on Spotify but probably not every artist in the world, we would not be able to make assumptions about the music industry as a whole. Another limitation we can notice is that some of the songs that fall under an artist's most popular songs are feautres of said artist on another song. This can be kind of confusing, but we might be able to work around it using filters once we start more analysis. However, this data is also helpful for answering our question about how collaborations affect an artist's popularity. One code we would need to make note of is 'key,' which denotes pitch notation but we need to find a way to make it easily apparent what the pitch is rather than just seeing a number. There could also be issues with repeat songs when songs are on more than one album, although sometimes it will be the same song but a slightly different rendition.

##Questions
Question 1: Which attributes, such as danceability, energy, loudness, etc., tend to have a correlation with the most streamed songs?

Analysis: After using code to select the variables we wanted to work with, we found the correlation coefficient for each song attribute in relation to the number of streams for each song. After finding the correlation coefficients, we calculated the mean among the coefficients for each attribute to average it out. After that, we looked for the maximum value among all the means, and found that danceability and streams had the highest positive correlation coefficient of about 0.073. There were some previous errors due to some values being NA, but using the filter for complete cases got rid of them, and the data we did use was so large that it provided a general idea of which attribute had the greatest correlation to number of streams.

```{r}
attribute_correlate_stream <- spotify_youtube %>% 
  select(danceability, energy, key, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, stream) %>% 
  filter(complete.cases(.)) %>%
## find the correlation coefficient between each attribute and number of spotify streams 
   mutate(
    dance_cor = cor(danceability, stream),
    energy_cor = cor(energy, stream),
    key_cor = cor(key, stream),
    speech_cor = cor(speechiness, stream),
    acoustic_cor = cor(acousticness, stream),
    instrumental_cor = cor(instrumentalness, stream),
    live_cor = cor(liveness, stream),
    valence_cor = cor(valence, stream),
    tempo_cor = cor(tempo, stream)
  ) %>% 
## find the mean correlation coefficient for each attribute
  summarise(
    mean_dance_cor = mean(dance_cor),
    mean_energy_cor = mean(energy_cor),
    mean_key_cor = mean(key_cor),
    mean_speech_cor = mean(speech_cor),
    mean_acoustic_cor = mean(acoustic_cor),
    mean_instrumental_cor = mean(instrumental_cor),
    mean_live_cor = mean(live_cor),
    mean_valence_cor = mean(valence_cor),
    mean_tempo_cor = mean(tempo_cor)
  )
##output the highest value among the mean coefficients
max_value <- apply(attribute_correlate_stream, 1, max)
  
```

Question 2: Does higher engagement on YouTube videos lead to more streams of the song from the video on Spotify? Is there a relationship that exists between social engagement and streams?

Analysis: In order to find and map out the correlation between Youtube video views and Spotify streams, we first grouped the data by artist and then summarised the data by the sum of views and sum of streams for each artist. After that, we plotted the data into a scatter plot with a line of best fit, which shows that there is a slight positive correlation between number of views on Youtube and number of streams on Spotify. This means that it can be generally true that as one value increases, so does the other. While doing this, we divided the totals by 1,000,000 because the numbers would have been to large to easily grasp. 

```{r}
##code to find the total number of views and streams (divided by 1,000,000 because numbers were too large to work with)
streams_views <- spotify_youtube %>% 
  select(artist, track, stream, views) %>% 
  mutate(difference = abs(stream-views)) %>% 
  group_by(artist) %>% 
  summarise(
    total_views = sum(views)/1000000,
    total_streams = sum(stream)/1000000,
  )
##create a scatterplot with line of best fit
streams_views %>% 
  ggplot(aes(x=total_streams, y=total_views))+
  geom_point(size=2)+
  theme_minimal()+
  geom_smooth(method = "lm")+
  labs(
    title = "Correlation between an artist's Spotify Streams and Youtube Views",
    x = "Number of Streams on Spotify",
    y = "Number of Views on Youtube"
  )
```

Question 3: How many videos with a high number of streams are coming from licensed content?

Analysis: 

```{r}
youtube_high_views <- spotify_youtube %>%
  filter(views > 500000000)

youtube_high_views %>%
  group_by(artist) %>%
  summarise (
    count_licensed = n()
  ) %>%
  arrange(desc(count_licensed))

youtube_high_views %>%
  filter(licensed == FALSE) 

youtube_high_views %>%
  group_by(channel) %>%
  filter(licensed == FALSE, official_video == TRUE)

##how to get totals for licensed and unlicensed?
```

Question 4: How do collaborations or features on a song affect its popularity on Spotify and Youtube? What are the most popular collaborations?

Analysis:

```{r}
song_features <- spotify_youtube %>%
  filter(str_detect(track, 'feat.'))

song_features %>%
  group_by(artist, stream) %>%
  arrange(desc(stream))

##How to count how many features an artist has, issue of repeats where the same song is listed under more than one artist (Mackelmore and Ryan Lewis)

```

